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QUANTIFYING GENE RELATEDNESS VIA NONLINEAR PREDICTION OF 

GENE EXPRESSION LEVELS 

FIELD 

The invention relates to computer analysis of gene relationships. 

BACKGROUND 

The development of complementary DNA ("cDNA") microarray technology has 
provided scientists with a powerful analytical tool for genetic research (M. Schena, D. 
Shalon, R.W. Davis, and P.O. Brown, "Quantitative monitoring of gene expression 
patterns with a complementary DNA microarray," Science, 270 [5235], 467-70, 1995). 
For example, a researcher can compare expression levels in two samples of biological 
material for thousands of genes simultaneously via a single microarray experiment. 
Application of automation to the microarray experiment processes further enhances 
researchers' ability to perform numerous experiments in less time. Consequently, an 
emerging challenge in the field of genetics is finding new and useful ways to analyze 
the large body of data produced by such microarray experiments. 

Overview of Gene Biology 
An organism's genome is present in each of its cells. The genome includes 
genes that hold the information necessary to produce various proteins needed for the 
cell's functions. For example, energy production, biosynthesis of component 
macromolecules, maintenance of cellular architecture, and the ability to act upon intra- 
and extracellular stimuli all depend on proteins. 

Although the number of human genes is estimated to be 30,000 to 100,000, and 
each gene relates to a different protein, only a portion of the possible proteins are 
present ("expressed") in any individual cell. Certain proteins called "housekeeping" 
proteins are likely to be present in all cells. Other proteins that serve specialized 
functions are only present in particular cell types. For example, muscle cells contain 
specialized proteins that form the dense contractile fibers of a muscle. Although the 
exact combination of factors which specify the transcription rate of each particular 
protein at any given moment are often unknown, researchers have a general 
understanding of the mechanics and steps related to protein expression. 
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The expression of genetic information has been summarized in the "central 
dogma/' which postulates that the flow of genetic information in a cell proceeds from 
DNA to RNA to protein. An initial step in the process of converting the genetic 
information stored in an organism's genome into a protein is called "transcription." 
5 Transcription essentially involves production of a corresponding mRNA molecule from 
a particular gene. The mRNA is later translated into an appropriate protein. One way 
to observe activity within the complex control network that is responsible for the 
generation of proteins within a cell is to observe transcript level (i.e., measuring the 
amount of mRNA) for a particular gene. The presence of a transcript from a gene 

1 0 indicates the biological system is taking the first steps to express a protein associated 
with the gene. This phenomenon is sometimes called "gene expression." Although 
complete operation of the mechanisms that control gene expression is not fully 
understood, users can gain insight into how the mechanisms operate by measuring 
expression levels for various genes. 

15 A cDNA microarray experiment can produce data measuring gene expression 

levels for a large number of genes in biological material. One way to analyze these 
expression levels is to identify a pair of genes having a linear expression pattern under a 
variety of conditions. 

For example, if the expression levels (e.g., "x" and "y") of two genes over a 

20 variety of observed conditions are plotted on a two-dimensional graph, it may be the 
case that a linear mathematical relationship (y = mx + b) is evident. To measure the 
linearity of the relationship between two genes, a researcher can calculate the Pearson 
correlation coefficient for points representing expression levels of the genes. A 
correlation coefficient with a high absolute value indicates a high degree of linear 

25 relationship between the two genes (i.e., if plotted in two-dimensional space, points 
representing the expression levels tend to fall in a line). Such genes are said to be 
"coexpressed." Coexpression does not necessarily indicate a cause-effect relationship. 
Still, given the immense volume of data collected for various sets of genes, discovering 
genes with linearly-related expression levels serves a useful purpose by identifying 

30 genes for further experimentation and study. 
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While identifying linearly-related gene pairs has led to various useful 
discoveries, there remains a need for other tools to analyze the wealth of data related to 
observation of gene expression levels. 

SUMMARY 

5 The invention provides methods and systems for analyzing data related to 

observation of gene expression levels. A nonlinear model can be constructed based on 
data comprising expression level observations for a set of genes. The nonlinear model 
predicts gene expression among the set of genes. The effectiveness of the nonlinear 
model in predicting gene expression can then be measured to quantify relatedness for 

10 genes in the set. 

One implementation uses a full-logic multivariate nonlinear model. The 
effectiveness of the model can be tested for a set of predictive elements. The predictive 
elements are inputs that can include gene expression levels and an indication of the 
condition to which observed biological material was subjected (e.g., ionizing radiation). 

1 5 Another implementation uses a neural network-based multivariate nonlinear 

model. For example, a ternary perceptron can be constructed with predictive elements 
as inputs and a predicted expression level for a predicted gene as an output. 
Effectiveness of the perceptron indicates relatedness among the predicted gene and 
genes related to the predictive elements. 

20 Various techniques can be used to construct and measure the effectiveness of 

nonlinear models. For example, the same data can be used to both train and test a 
perceptron. Alternatively, the data can be divided into a training set and a test set. 
Further, the same data can be reused to train by randomly reordering the data. To avoid 
imnecessary computation, data for genes having expression levels not changing more 

25 than a minimum number of times can be ignored under certain circumstances. 

Effectiveness of the nonlinear model can be measured by estimating a 
coefficient of determination. For example, a mean square error between a predicted 
value and a thresholded mean of observed values in a test data set can be calculated for 
the model. 
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Various user interface features allow users to interact with data in a variety of 
ways helpful for finding related genes. For example, a graph of a set of predictive 
elements and a predicted gene can show each predictive element's contribution to the 
effectiveness of predicting expression of the predicted gene via the size of a bar. The 
gene represented by a display element can be denoted by color. Further, redundant 
predictive element sets (e.g., a gene set in which one of the genes does not contribute to 
the effectiveness of the model) can be removed from a presentation of the analysis. 

If the technique indicates a group of genes are related, researchers can then 
further investigate the group if appropriate. 

Additional features and advantages of the invention will be made apparent from 
the following detailed description of illustrated embodiments, which proceeds with 
reference to the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram illustrating gene expression level data collection via a 
cDNA microarray experiment. 

FIG. 2 is a table containing gene expression level data, such as that collected via 
a number of cDNA microarray experiments. 

FIG. 3 is a diagram illustrating a simple representation of a gene expression 
pathway. 

FIG. 4 is a block diagram representing a genetic system. 

FIG. 5 is a block diagram representing a genetic system and an associated 
multivariate nonlinear model of the genetic system. 

FIG. 6 is a flowchart showing a method for quantifying gene relatedness. 

FIG. 7A is a block diagram showing a multivariate nonlinear predictor. 

FIG. 7B is a table illustrating effectiveness measurement for the multivariate 
nonlinear predictor of FIG. 7A. 

FIG. 8 is a flowchart showing a method for identifying related genes out of a set 
of observed genes. 
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FIG. 9 is a Venn diagram showing a relationship between unconstrained and 
constrained predictors. 

FIG, 10 is a diagram showing a graphical representation of a perceptron 
multivariate nonlinear model for predicting gene expression for a predicted gene. 
5 FIG. 1 1 is a screen capture showing an exemplary user interface for presenting 

effectiveness of predictors constructed to predict gene expression among a variety of 
gene groups. 

FIG. 12 is a flowchart showing a method for analyzing gene expression data to 
identify related genes. 

10 FIG. 13 is a flowchart showing a method for constructing a full-logic model to 

predict gene expression. 

FIG. 14 is a flowchart showing a method for testing the effectiveness of a full- 
logic model in predicting gene expression. 

FIG. 15 is a flowchart showing a diagram of a two-layer neural network. 
1 5 FIG. 1 6 is a flowchart showing a method for training a perceptron to predict 

gene expression. 

FIG. 17 is a flowchart showing a method for testing the effectiveness of a 
perceptron in predicting gene expression. 

FIG. 18 is a graph denoting error of unconstrained and constrained predictors. 
20 FIG. 19 is an exemplary arrow plot of a prediction tree showing coefficient of 

determination calculations. 

FIGS. 20A-20C are arrow plots of prediction trees showing coefficient of 
determination calculations. 

FIGS. 21 A-D are arrow plots of prediction trees for full-logic predictors. 
25 FIGS. 22A-D are arrow plots of prediction trees for perceptron predictors. 

FIGS. 23 A-D are arrow plots of prediction trees comparing full-logic and 
perceptron predictors. 

FIG. 24 is a screen capture of a graph showing increases in the coefficient of 
determination resulting from addition of predictive elements. 
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FIG. 25 is a screen capture showing a user interface presenting plural sets of 
predictive elements for a particular predicted gene. 

FIG. 26 is a screen capture showing a cubical representation of data for a 
particular set of predictive elements and a predicted gene AHA. 
5 FIG. 27 is a screen capture showing a cubical representation of a perceptron for 

a particular set of predictive elements and a predicted gene AHA. 

FIG. 28 is a screen capture showing a logic circuitry representation of a 
perceptron for a particular set of predictive elements and a predicted gene AHA. 

FIG. 29 is a block diagram illustrating a computer system suitable as an 
1 0 operating environment for an implementation of the invention. 

DETAILED DESCRIPTION 
Definitions 

Additional definitions of terms commonly used in molecular genetics can be 

15 found in Benjamin Lewin, Genes F published by Oxford University Press, 1994 (ISBN 
0-19-854287-9); Kendrew et al (eds.). The Encyclopedia of Molecular Biology, 
published by Blackwell Science Ltd., 1994 (ISBN 0-632-02182-9); and Robert A, 
Meyers (ed.). Molecular Biology and Biotechnology: a Comprehensive Desk Reference, 
published by VCH Publishers, Inc., 1995 (ISBN 1-56081-569-8). 

20 Gene relatedness includes genes having any of a variety of relationships, 

including coexpressed genes, coregulated genes, and codetermined genes. The 
mechanism of the relationship need not be a factor in determining relatedness. In the 
network that controls gene expression, a gene may be upstream or downstream from 
others; some may be upstream while others are downstream; or they may be distributed 

25 about the network in such a way that their relationship is based on chains of interaction 
among various intermediate genes or other mechanisms. 

A probe comprises an isolated nucleic acid which, for example, may be attached 
to a detectable label or reporter molecule, or which may hybridize with a labeled 
molecule. For purposes of the present disclosure, the term "probe" includes labeled 

30 RNA from a tissue sample, which specifically hybridizes with DNA molecules on a 



6 



E-059-00/0 06/15/00 4239-54279/WDN/GLM EXPRESS MAIL EM126586308US 

cDNA microarray. However, some of the literature describes microarrays in a different 
way, instead calling the DNA molecules on the array "probes." Typical labels include 
radioactive isotopes, ligands, chemiluminescent agents, and enzymes. Methods for 
labeling and guidance in the choice of labels appropriate for various purposes are 
5 discussed, e.g., in Sambrook et al, in Molecular Cloning: A Laboratory Manual, Cold 
Spring (1989) and Ausubel et al., in Current Protocols in Molecular Biology, Greene 
Publishing Associates and Wiley-Intersciences (1987), 

Hybridization: Oligonucleotides hybridize by hydrogen bonding, which 
includes Watson-Crick, Hoogsteen or reversed Hoogsteen hydrogen bonding between 

10 complementary nucleotide units. For example, adenine and thymine are 

complementary nucleobases which pair through formation of hydrogen bonds. 
"Complementary" refers to sequence complementarity between two nucleotide units. 
For example, if a nucleotide unit at a certain position of an oligonucleotide is capable of 
hydrogen bonding with a nucleotide unit at the same position of a DNA or RNA 

1 5 molecule, then the oligonucleotides are complementary to each other at that position. 
The oligonucleotide and the DNA or RNA are complementary to each other when a 
sufficient number of corresponding positions in each molecule are occupied by 
nucleotide units which can hydrogen bond with each other. 

Specifically hybridizable and complementary are terms which indicate a 

20 sufficient degree of complementarity such that stable and specific binding occurs 

between the oligonucleotide and the DNA or RNA target. An oligonucleotide need not 
be 100% complementary to its target DNA sequence to be specifically hybridizable. 
An oligonucleotide is specifically hybridizable when there is a sufficient degree of 
complementarity to avoid non-specific binding of the oligonucleotide to non-target 

25 sequences under conditions in which specific binding is desired. 

An experimental condition includes any number of conditions to which 
biological material can be subjected, including stimulating a cell line with ionizing 
radiation, ultraviolet radiation, or a chemical mutagen (e.g., methyl methane sulfonate, 
MMS). Experimental conditions can also include, for example, a time element. 
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Gem expression is conversion of genetic information encoded in a gene into 
RNA and protein, by transcription of a gene into RNA and (in the case of protein- 
encoding genes) the subsequent translation of mRNA to produce a protein. Hence, 
expression involves one or both of transcription or translation. Gene expression is 
5 often measured by quantitating the presence of mRNA. 

Gene expression level is any indication of gene expression, such as the level of 
mRNA transcript observed in biological material A gene expression level can be 
indicated comparatively (e.g., up by an amount or down by an amount) and, further, 
may be indicated by a set of discrete values (e.g., up-regulated, unchanged, or down- 
10 regulated). 

A predictive element includes any quantifiable external influence on a system or 
a description of the system's state. In the context of gene expression, a predictive 
element includes observed gene expression levels, applied stimuli, and the status of 
biological material. In addition to those described below, possible external influences 

15 include use of pharmaceutical agents, peptide and protein biologicals, lectins, and the 
like. In addition to mutated or inactivated genes, possible cell states include transgenes, 
epigenetic elements, pathological states (e.g., cancer A or cancer B), developmental 
state, cell type (e.g., cardiomyocyte or hepatocyte), and the like. In some cases, 
combinations of predictive elements can be represented as a single logical predictive 

20 element. 

Multivariate describes a function (e.g., a prediction function) accepting more 
than one input to produce a result. 

Overview of Gene Expression Data 
Although gene expression data has been collected in a variety of ways, advances 
25 in cDNA microarray technology have enabled researchers to increase the volume and 
accuracy of data collection. (Worley et al, "A Systems Approach to Fabricating and 
Analyzing DNA Microarrays," Microarray Biochip Technology, pp. 65-85, Schena, ed., 
Eaton Publishing, 2000). An overview of a cDNA microarray experiment is shown in 
FIG. 1 . A cDNA microarray typically contains many separate target sites to which 
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cDNA associated with respective genes has been appHed. A single microarray may 
contain cDNA associated with thousands of genes. 

During a process called hybridization, a sample with unknown levels of mRNA 
is applied to the microarray. After hybridization, visual inspection of the target site 
5 indicates a gene expression level for the gene associated with the target site. Further 
details concerning cDNA microarray technology can be found in Chen et ah, U.S. 
Patent Application No. 09/407,021, filed on September 28, 1999, entitled "Ratio Based 
Decisions and the Quantitative Analysis of cDNA Micro-Array Images" and Lockhart 
et al., U.S. Patent No. 6,040,138, filed September 15, 1995, entitled "Expression 

10 Monitoring by Hybridization to High Density Oligonucleotide Arrays," both of which 
are hereby incorporated herein by reference. 

In a specific example of the cDNA microarray technique shown in FIG. 1, 
samples of mRNA from two different sources of biological material are labeled with 
different fluorescent dyes (e.g., Cy3 and Cy5). One of the samples (called a "control 

1 5 probe") 1 04 serves as a control and could be, for example, mRNA for a cell or cells of a 
particular cell line. The other sample (called an "experimental probe") 106, could be, 
for example, mRNA from the same cell line as the control probe, but from a cell or 
cells subjected to an experimental condition (e.g., irradiated with ionizing radiation). 
Different fluorescent dyes are used so the two samples can be distinguished 

20 visually (e.g,, by a confocal microscope or digital microscanner). The samples are 
sometimes called "probes" because they probe the target sites. The samples are then 
co-hybridized onto the microarray; mRNA firom the sample (and the associated dye) 
hybridizes (i.e., binds) to the target sites if there is a match between mRNA in a sample 
and the cDNA at a particular target site. A particular target site is associated with a 

25 particular gene. Thus, visual inspection of each target site indicates how much (if any) 
mRNA transcript for the gene was present in the sample. For example, if the control 
probe is labeled with red dye, and a particular target site is associated with gene "A," 
abundant presence of the color red during visual inspection of the target site indicates a 
high gene expression level for gene "A" in the control probe. The intensity of the red 

30 signal can be correlated with a level of hybridized mRNA, which is in turn one measure 
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of gene expression. The process thus can quickly provide gene expression level 
measurements for many genes for each sample. 

The system 1 12 is typically a complex combination of subsystems that process 
the two samples 104 and 106 to quantify the hybridization of the probes to the 

5 microarray (and, thus the amount of mRNA transcript related to the gene in the probe 
and, therefore, the expression level of a gene associated with a target site). For 
example, a system 1 12 could comparatively quantify hybridization to produce results 
such as those shown in expression data 120, As shown, expression data 120 can 
specify expression comparatively (e.g., expression of the experimental probe vis-a-vis 

1 0 the control probe). Ratios of gene-expression levels between the samples are used to 
detect meaningfully different expression levels between the samples for a given gene. 
The phenomenon of observing an increase in the amount of transcript is called "up- 
regulation." A decrease is called "down-regulation." Although many of the details of 
the system 1 12 are not necessary for an understanding of the invention, the system 1 12 

1 5 typically employs a wide variety of techniques to avoid error and normalize 
observations. 

For example, system 1 12 can include a procedure to calibrate the data internally 
to the microarray and statistically determine whether the raw data justify the conclusion 
that expression is up-regulated or down-regulated with 99% confidence. (Chen et aL, 
20 U.S. Patent Application No. 09/407,021, filed on September 28, 1999, entitled "Ratio 
Based Decisions and the Quantitative Analysis of cDNA Micro- Array Images"). The 
system 1 12 can thus avoid error due to experimental variability. 

As microarray technology progresses, the number of genes reliably measurable 
in a single microarray experiment continues to grow, leading to numerous gene 
25 expression observations. Further, automation of microarray experiments has 

empowered researchers to perform a large number of microarray experiments seriatim. 
As a resuh, large blocks of data regarding gene expression levels can be collected, 

FIG. 2 shows gene expression level observations for a set of A: microarray 
experiments. For the experiments, a variety of cell lines have been subjected to a 
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variety of conditions, and gene expression level observations for n genes have been 
collected. 

Although much data relating to gene expression can be collected, a major 
challenge for researchers is analyzing the data to determine v^hether genes in the 
5 experiment are related. As a result of careful analysis of a biological system, it may be 
possible to imderstand a portion of the control network responsible for controlling gene 
expression. These relationships are sometimes called "expression pathways" in the 
genetic network. 

For example, a hypothetical expression pathway for a gene Ggo? is shown in 

1 0 FIG. 3 . In the example, expression of gene Ggo? is the result of an as yet undiscovered 
mechanism in combination with expression of gene Gi7437j which in turn is the result of 
expression of genes G17436 04^7. Discovery of an expression pathway represents a 
breakthrough in understanding of the system controlling gene expression and can 
facilitate a wide variety of advances in the field of genetics and the broader field of 

1 5 medicine. For example, if expression of GgQ7 causes a disease, blocking the pathway 
(e.g., with a drug) may avoid the disease. 

Overview of Multivariate Nonlinear Models for Predicting Gene Expression 
The system 402 responsible for expression of a particular gene can be 
represented as shown in FIG. 4. In the system 402, a set of observable inputs X^-X^ and 

20 a set of unknown inputs U^-U^ operate to produce an observed result, F^bserved i^^ simply 
y). Some of the observable inputs X^-X^ can correspond to gene expression levels. For 
the sake of example, it is assumed that the internal workings of the system 402 are 
beyond current understanding. 

FIG, 5 shows a multivariate nonlinear model 502 for predicting the operation of 

25 the system 402. The model 502 is illustrated as comprising logic, but any multivariate 
nonlinear model can be used. The model 502 takes the observed inputs X^-X^ as inputs 
and provides a predicted output, Tpred* ^ model predicting gene expression, the inputs 
X^'X^ represent a set of predictive elements (e.g., gene expression levels, experimental 
conditions, or both), and 7 represents the expression level of a predicted gene. The 

30 effectiveness of the multivariate nonlinear model 502 in predicting gene expression for 
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the predicted gene can be measured to quantify the relatedness of the predicted gene 
and genes associated with the predictive elements. Measuring the effectiveness of the 
multivariate nonlinear model 502 can be accomplished by comparing Y^^^^ and Yobserw^d 
across a data set. 

5 For example, a simple flowchart showing a method for quantifying gene 

relatedness is shown in FIG. 6. At 602, a multivariate nonlinear model is constructed 
based on data from gene expression level observation experiments. At 608, the 
effectiveness of the multivariate nonlinear model is measured to quantify gene 
relatedness. 608 can comprise predicting gene expression with the multivariate 
10 nonlinear model, then comparing predicted gene expression with observed gene 
expression. 

An exemplary multivariate nonlinear model 704, such as that constructed in 602 
(FIG. 6), is shown at Fig. 7A. Construction of the model 704 is described in more 
detail below. The model 704 takes three inputs (Gj, Gj, and Cj) and produces a single 

15 output (G3). In the example, the output is one of three discrete values chosen from the 
set {down, unchanged, and up}. 

At FIG. 7B, a table 754 illustrates an exemplary measurement of the 
effectiveness of the model 704. Data collected during k microarray experiments is 
applied to the model 704, which produces a predicted value for G3 for each experiment 

20 (G3 p^e^i). In the example, effectiveness of the model 704 is calculated by comparing 
G3 pred with G3 observed ^cross thc data set comprising the experiments. Various methods 
can be used to measure effectiveness in such a technique. For example, the 
effectiveness of the model 704 can be compared with a proposed best alternative 
predictor (e.g., the thresholded mean of G3 observed)- ^he example, comparison with the 

25 thresholded mean of G3 observed provides a value between 0 and 1 called the "coefficient 
of determination." The provided value can be said to estimate the coefficient of 
determination for an optimal predictor. For the model 704, a value of 0.74 is provided 
and quantifies the effectiveness of the model 704 and the relatedness of the genes Gj, 
G2, and G3 (and the condition Cj). In the example, a high value indicates more 
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relatedness than a low value, and the value falls between 0 and 1 . However, any 
number of other conventions can be used (e.g., a percentage or some other rating). 

The above techniques can be expanded for application to a large set of observed 
genes (i.e., genes for which gene expression levels have been observed). For example, 
5 FIG. 8 shows a method for analyzing gene expression data to quantify relatedness 
among a large set of / genes. 

At 802, one of the observed genes is designated as a predicted gene. Typically, 
for each of the predicted genes, 808-820 are repeatedly performed. At 808, a set of 
predictive elements (e.g., gene expression levels, experimental conditions, or both) are 

10 designated for the predicted gene. At 812, a multivariate nonlinear model based on 
data from the experiments is constructed having the predictive elements as inputs and 
the predicted gene as an output. At 816, the effectiveness of the model in predicting 
expression of the predicted gene is measured to quantify relatedness of the predictive 
elements and the predicted gene. 

15 At 820, if there are more possible permutations of predictive elements for the 

predicted gene, the method proceeds to 808. Otherwise, at 832, if there are more 
possible genes to be predicted, the method proceeds to 802, and another predicted gene 
is chosen. Although the method is shown as a loop, it could be performed in some 
other way (e.g., in parallel). 

20 Exemplary data for use with the method of FIG. 8 is shown in Table 1 , where 

"+1," "0," and "-1" represent up-regulation, unchanged, and down-regulation, 
respectively. For the j conditions Cj-Cj, "y" means the experimental sample was 
subjected to the indicated condition, and **n" means the experimental sample was not. 



Table 1 - Gene Expression Data for / Genes over k Experiments 
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Exemplary results of the analysis of the data shown in Table 1 are shown in 
Table 2. In the example, results for groups of three predictive elements are shown. 
Typically, results for groups of other sizes (e.g., two) are also provided. As described 
5 in further detail below, it is also possible to omit results or skip genes under certain 
circumstances. 

According to the results, relatedness of the genes G2, G3, G5, and is 0.71 
(based on a model with G^ as the predicted gene), and relatedness of the genes G2, G3, 
Gi, and condition is 0.04 (based on another model). 
10 Table 2 - Results of Analysis of Gene Expression Data of Table 1 
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Exemplary Multivariate Nonlinear Models Predicting Gene Expression 

A wide variety of multivariate nonlinear models for predicting gene expression 
are possible, and they may be implemented in software or hardware. For example, the 
1 5 form of a full-logic model having three ternary inputs Xj, X2, and X3 and one ternary 
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output 7p,gd is shovm in Table 3. The full-logic model shown takes the form of a ternary 
truth table. Constructing the illustrated full-logic model thus involves choosing values 
for at least some of yi-y27 from the set {-1, 0, +1} to represent gene expression. 
Although the term "full-logic" typically implies that a value will be chosen for each 
5 possible y, it is possible to delay or omit choosing a value for some y's and still arrive at 
useful results. 

The number of inputs can vary and a convention other than ternary can be used. 
If there are m input variables, then the table has 3^ rows and m + 1 columns, and there 
are 3^ possible models. 
1 0 Table 3 - Full-logic Multivariate Nonlinear Model 
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Constrained and Unconstrained Models 

The illustrated full-logic model is sometimes called an "unconstrained" model 
because its form can be used to represent any of the 3^'" theoretically possible models. 
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Under certain circumstances (e.g., when relatively few data points are available), it may 
be advantageous to choose the model from a constrained set of models. 

For example, another possible multivariate nonlinear model is a neural network. 
A ternary perceptron is a neural network with a single neuron. As shown in FIG. 9, 
5 there are fewer possible ternary perceptrons than full-logic models, so ternary 
perceptrons are sometimes called "constrained" models. Any number of other 
constraints can be imposed. For example, a truth table or decision tree may take a 
constrained form. 

In some sense, it is desirable to choose the optimal (i.e., best) model; however, 
10 it is sometimes imknown whether the optimal model can be found within a constrained 
set of models. For example, it may be that the optimal model is model 908, which 
would not be found if the models are constrained to ternary perceptrons. On the other 
hand, the optimal model may be model 912, which can be represented by either a full- 
logic model or a ternary perceptron. Deciding whether to use constrained or 
1 5 unconstrained models typically depends on the sample size, as it can be shown that 
choosing from constrained models can sometimes reduce the error associated with 
estimating the coefficient of determination when the sample size is small. Further 
details concerning the phenomenon are provided in a later section. 
Exemplary Perceptron 
20 A perceptron can be represented by the equation 

Ypred= T{a,X, + a^, + ... + + Z>) (1) 
where Tis a thresholding function, J^, ^2 . . are inputs, Y^^^^ is an output, and 
ai, a2 -'M^ and b are scalars defining a particular perceptron. For a binary perceptron, a 
binary threshold function is used: T{z) = 0 if z < 0, and T{z) = 1 if z > 0. For a ternary 
25 perceptron, a ternary threshold function is used: 

T{z) = -1 if z < -0.5, T{z) - 0 if -0.5 < z < 0.5, and T{z) = +1 if z > 0.5 
FIG. 10 shows one possible graphical representation 1022 of the perceptron 
shown in Equation 1 . Thus, to design the perceptron, a^-a^ and b are chosen based on 
observed gene expression levels. Typically, designing the perceptron involves a 
30 training method, explained in detail below. 
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Exemplary User Interface for Presenting Results 

An exemplary user interface for presenting data similar to that of Table 2 is 
shown in the display region 1 102 of FIG, 1 1 . Data can be limited to a particular 
predicted gene (sometimes called a "target gene") by specifying the gene in box 11 06. 
5 Results pane 1118 displays a set of bars 1 1 50a, 1 1 50b, and 11 50c. Each of the bars 
represents a set of predictive elements used in a particular multivariate nonlinear model 
to predict the predicted gene of box 1 106. Effectiveness of the model is indicated by 
the size of the bar. Further, each predictive element's contribution to the effectiveness 
of the model is indicated by the size of the bar segment representing the predictive 
10 element. 

There are various w^ays to assign a contribution for each element. For example, 
consider a set of three predictive elements having a combined measure of effectiveness. 
A first element with the highest effectiveness when used alone (e.g., in a univariate 
model) is assigned a contribution of the measured effectiveness when used alone (i.e., 

15 the effectiveness of the univariate model). Then, effectiveness of the most effective 
pair of predictive elements including the first element is measured, and the second 
element of the pair is assigned a contribution equal to the effectiveness of the pair 
minus the effectiveness of the first element. Finally, the third element of the pair is 
assigned a contribution to account for any remaining effectiveness of the three 

20 combined elements. 

As shown, the bars are ordered (i.e., the more effective models are shown at the 
top). Additionally, the appearance of each predictive element within the bars can be 
ordered. By default, the predictive elements are ordered within the bars by contribution 
to effectiveness (e.g., genes with larger contribution are placed first). However, a user 

25 may choose to order them so that a particular predictive element appears in a consistent 
location. For example, the user may specify that a particular predictive element appear 
first (e.g., to the left) inside the bar. 

To aid in visual interpretation of the data, a legend 1 160 is presented in which 
each predictive element is shown in a box of a particular color. The bars 1 150a, 1 150b, 

30 and 1 1 50c take advantage of the legend 1 1 60 by representing each predictive element 



17 



E-059-00/0 06/1 5/00 4239-54279/WDN/GLM 



EXPRESS MAIL EM 1265 863 08US 



in the color as indicated in the legend 1 160. As explained in greater detail below, a 
wide variety of alternative or additional user interface features can be incorporated into 
the user interface. 

Details of Exemplary Implementations 

5 A more detailed method for quantifying gene relatedness via multivariate 

nonlinear prediction of gene expression is shown in FIG. 12. The method can use data 
similar to that shown for Table 1 and quantifies gene relatedness by estimating a 
coefficient of determination. For example, an experiment involving a collection of 
microarrays might measure relatedness for 8,000 genes. 

10 At 1202, genes having fewer than a minimum number of changes (e.g., 12%) 

are removed as potential predicted genes. In this way, vmnecessary computation 
regarding genes providing very little information is avoided. 

At 1206, one of the remaining genes is selected as the predicted (or ''target") 
gene. For a given predicted gene, the method computes coefficients of determination 

1 5 for possible combinations of one, two, and three predictive elements. A restriction on 
the number of genes can assist in timely computation of greater numbers of predictive 
elements (e.g., four). Depending on the number of microarrays in the experiments, 
results yielded by excessive predictive elements may not be meaningful. 

At 1208, a set of predictive elements is selected. Then, at 1212, a multivariate 

20 nonlinear model is constructed for the predictive elements and the predicted gene. The 
set of data used to construct the model is sometimes called the "training data set." 

At 1222, the effectiveness of the multivariate nonlinear model is measured to 
quantify relatedness between the predicted gene and genes associated with the 
predictive elements. In the example, a coefficient of determination is calculated. The 

25 data used to measure the model's effectiveness is sometimes called the "test data set." 
The test data set may be separate from, overlap, or be the same as the training data set, 
A variety of techniques can be used to improve estimation of the coefficient of 
determination. For example, multiple attempts can be made using subsets of the data 
and an average taken. The coefficient of determination is stored in a list for later 

30 retrieval and presentation. 
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At 1232, it is determined whether there are more permutations of predictive 
elements for the predicted gene. If so, the method continues at 1208. Otherwise, at 
1242 it is determined whether there are more genes to be designated as a predicted 
gene. If so, the method continues at 1206. Although the method is shown as a loop, it 
5 could be performed in some other way (e.g., in parallel). 

Then, at 1248, redundancies are removed from the list of coefficients of 
determination. For a set of predictive elements, one or more of the predictive elements 
may not contribute to the coefficient of determination. Data relating to these predictive 
elements is removed. For example, if predictive elements "A" and "B" predict " Y" with 
10 a coefficient of determination of 0.77 and "A," "B,^' and "C" predict "Y" at 0.77, the 
data point for "A,'* "B," and "C" is redundant and is thus removed. 

The results (i.e., the list of coefficients of determination) are then presented at 
1252. A variety of user interface features can assist in analysis of the results as 
described in more detail below, 
15 A variety of multivariate nonlinear models for predicting gene expression are 

possible. Full-logic and neural network implementations are described below. 

Full-Logic Predictor 
A method for constructing a full-logic multivariate nonlinear model to predict 
gene expression is illustrated in FIG. 13. The model can be built using data such as that 
20 shown in Table 1. Typically, the inputs and output of the model are chosen, then the 
model is built based on those inputs and outputs. For example, the inputs might be an 
expression level for gene "A," and expression level for gene "B," and an experimental 
condition "C." The output is a prediction of the expression level for gene " Y." 

At 1302, a next observed permutation is selected, starting with the first. It may 
25 be that all possible permutations of the three inputs appear in the data (e.g., all possible 
arrangements of "-1," "0," and "+1" are present for inputs "A," "B," and "C"). 
However, some permutations may not appear. 

At 1304, a predicted value is chosen for the permutation. In some cases, a 
single result (e.g., "-1 ") will be observed for all instances of a particular permutation. 
30 In such a case, the predicted value is the result (i.e., "-1 "). In other cases, there may be 
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two results observed. In such a case, the resuh with the most instances is chosen. In 
still other cases, there may be more than two results observed. In such a case, a 
thresholded weighted mean is chosen as the predicted value. 

If there are more observed permutations, at 1308, the method proceeds to 1302. 
5 After completion, the full-logic model has been built. The full-logic value can then 
supply a predicted value given a particular permutation of predictive elements. 

A method for testing a full-logic model to measure the coefficient of 
determination is shown in FIG. 14. For each data point in the testing set (e.g., a row of 
data), the data point applied to the model (i.e., submitted as predictive elements) to 
10 produce a predicted result at 1402. If there are more data points, at 1404, the method 
proceeds to 1402. 

Then, at 1406, the predicted results are compared with the observed results to 
measure effectiveness of the model. A coefficient of determination (9) is estimated 
using the following equation: 



where n is the number of data points submitted as a test data set, T is a thresholding 
function, and |a,Y is the mean of observed values of Y. is the observed Y at a 
particular data point in the test data, and Yp^^a^ is the corresponding value predicted for 

Y by the full-logic model at the same data point (via the predictive elements for the 
20 data point). 

Note that the predicted results can be stored for comparison at the end of the 
method or compared as the method proceeds. Other approaches can be used to achieve 
similar results. 



25 undefined for a set of inputs (e.g., the permutation of input variables was not observed 
in the training data set). In such a case, a variety of approaches are possible, and 
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(2) 



In some instances, it may be that the prediction for the full-logic model is 



20 



E-059-00/0 0671 5/00 4239-54279AVDN/GLM 



EXPRESS MAIL EM126586308US 



various rules exist for defining output. For example, a technique called "nearest 
neighbor" can be employed to pick a value, the value may be defined as an average of 
neighbors, or the value may simply be defined as "0." 

Neural Network Predictor 
5 Neural networks offer a multivariate nonlinear model which can be trained in 

order to construct an artificial intelligence fijnction. Constraint can be increased or 
decreased based on the form of the network. A wide variety of neural network forms 
are possible. A two-layer neural network is of the form 

r m 

Ypred=g2^^ ^jS^iT^^k^k)) (3) 
j=0 k^O 

10 where Xf^ are the input variables (Xq =1), and and g2 are the activation 

functions for the first and second layers, respectively. A diagram of a two-layer neural 
network is shown in FIG. 15. There are a number of training algorithms available for 
neural networks, one of which is explained below. 

Training a Perceptron to Predict Gene Expression Levels 

15 A perceptron is sometimes called a neural network with a single neuron. 

FIG. 16 illustrates a method for training a perceptron. The perceptron is defined by the 
coefficient vector A = (a^, . . ., b) and its output is determined as shown in 
Equation 1, above. Alternatively, a perceptron is sometimes described as defined by 
the coefficient vector A = {cIq, a^, . . . , and having an additional input Xq, which is 

20 a constant scalar (e.g., 1). 

Training is achieved via a training sequence of data of predictive elements and 
observed values (X^ Y^), (X^, Y^) . . . (X", Y"). A variety of techniques can be used to 
extend a small training set (e.g., the values can be recycled via plural random 
orderings). In summary, the training set is applied to the perceptron. If the perceptron 

25 is in error, the coefficient vector is adjusted. 

At 1604, the perceptron is initialized. Initialization can be either random or 
based on training data. For example, A can be initialized to the cross-covariance vector 
C between X and Y and the autocorrelation matrix R of V = (1, Xj, Jl25 • • ^m)- A is 
thus initialized to R^C, where R^ is the pseudoinverse of R. 
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At 1608, a data point in the training data is applied to the perceptron (e.g., a set 
of observed values for predictive elements are applied and a predicted value is 
generated based on A). 

A variety of training techniques can be used. Generally, a training technique is 
5 characterized by the update 

^(new) ^ ^(old) + 

where AA is called a "transition," 

At 1612, error is evaluated by comparing the observed value w^ith the predicted 
value. For example, the training factor can be defined by 

10 f = (A^^^^^^V-Y) (5) 

where V is the vector of predictive inputs and 7 is an observed value. The training 
error is then 



e = f 



Y 



pred 



(6) 

If e is greater than zero, typically the training method attempts to decrease e by 
1 5 appropriately transitioning A. If e is less than zero, the training method attempts to 

increase e by appropriately transitioning A. An e value of zero implies there is no error, 
so A is not transitioned. The sign and magnitude of the transition can be determined by 

/ 

An appropriate transition can thus be defined as follows: 
20 A4i = -/a/(7-7,,e,)X/ (7) 

where the values a/ are gain factors (0< a/ <1). The gain factors can be advantageously 
selected for proper training under certain circumstances. Other arrangements can be 
used (e.g., substituting e for /in the transition or calculating the values in some other 
order or notation to reach a similar result). 
25 At 1614 if e is zero, A4/ = 0 and no transition is necessary, so the method 

continues at 1632. Otherwise, A is transitioned by applying AA as explained above. 

At 1632, it is determined whether training is finished. If not, the method 
continues at 1608. Any number of stopping criteria can be used, such as testing 
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whether A converges, a fixed number of iterations complete, or a minimum error 
tolerance is reached. 

Alternatively, rather than transitioning A after each data point in the training 
data set, A can be transitioned after the completion of a cycle of the training data set. 
5 For such a technique, e and A4/ can be computed for each data point as above; 

hov^ever, A is updated at the end of each cycle, with AA being the sum of the stepwise 
increments during the cycle. Similar gain and stopping criteria can be applied. 

Testing the Trained Perceptron to Quantify Relatedness 
After training is completed, the perceptron can be tested to quantify relatedness 
1 0 of the predictive elements and the predicted gene. A method for testing a perceptron is 
shown in FIG. 17. At 1702, a data point fi-om the testing data set is applied to the 
perceptron (i.e., the predictive elements are submitted as inputs to the perceptron) to 
provide a predicted result, Ypred- Then, at 1704 the predicted result is compared with 
an observed result (e.g., from an experiment observing gene expression). At 1708, if 
1 5 there are more data points, the method continues at 1 704. Otherwise, at 1 712, a 
measure of the effectiveness of the perceptron is provided. 

Other approaches can be used to provide similar results. For example, the 
predicted results can be stored and compared as a group to the observed results. An 
exemplary measure of the effectiveness of the perceptron is a coefficient of 
20 determination, which is found by applying Equation 2, above. 

Exemplary Combination of Training and Testing of a Perceptron 
The above techniques can be applied in a variety of ways. For example, for a 
set of 30 data points observed, a set of 20 training data sets can be generated (e.g., by 
randomly reordering, choosing a subset of the observations, or both). Then, for each of 
25 the 20 sets of training data, a perceptron is initiahzed to R^C and trained on the data. 
The better of the trained perceptron and the initialized perceptron is then applied to 10 
test data sets (e.g., subsets of the observation data that may or may not overlap with the 
training data) to obtain a measure of the coefficient of determination. 

The entire process above can be repeated a large number of times (e.g., 256), 
30 and the average of the coefficients of determination is considered to be a quantification 
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of relatedness. In some cases, the perceptron may perform more poorly than a baseline 
predictor (e.g., the thresholded mean). If so, measurement of the perceptron is 
considered to be zero. 

Moreover, if it is the case that the coefficient of determination for a perceptron 
5 is greater when one of the variables is removed, the coefficient of determination is 
considered to be the greater of the two measurements. 

Unconstrained and Constrained Predictors 
The theoretically optimal predictor of the target 7 is typically unknown and is 
statistically estimated. The theoretically optimal predictor has minimum error across 

10 the population and is designed (estimated) from a sample by a training (estimation) 

method. The degree to which a designed predictor approximates the optimal predictor 
depends on the training procedure and the sample size n. Even for a relatively small 
number of predictive elements, precise design of the optimal predictor typically 
involves a large number samples. The error, s„, of a designed estimate of the optimal 

1 5 predictor exceeds the error, s^^p^, of the optimal predictor. For a large number of 

experiments, s„ approximates e^pt? but for the small numbers sometimes used in practice, 
s„ may substantially exceed e^^^. Although gene expression levels for many genes are 
observed during a typical single microarray experiment, currently only a few samples 
of biological material are evaluated per microarray. 

20 As the number of system inputs grows, the amount of replicated observations 

necessary for precise statistical design of an optimal predictor grows more rapidly. 
Since a designed predictor depends on a randomly drawn sample data set, we use 
expectations for statistical analysis. Hence, we are concerned with the difference, 
E[z^ - e^pt, between the expected error of the designed predictor and the error of the 

25 optimal predictor. A small difference means that E[e^ provides a good approximation 

to Sopf 

It is sometimes useful to estimate the best predictor from a constrained set of 
predictors. Since the optimal constrained predictor is chosen from a subset of the 
possible predictors, its theoretical error exceeds that of the best predictor; however, the 
30 best constrained predictor can be designed more precisely from the data. The error. 
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^con,n? ^ estimate of the optimal constrained predictor exceeds the error, eopt_con,? the 
optimal constrained predictor. We are concerned with the difference, E[s^^^ J- ^opt-con- 

If we let 8, = £[8j - s^p^ and 5,,„ „ - E[s,^^ J - e,p,.,^„, then the challenges of 
finding good predictors of gene expression levels are evidently the following: 
5 e^pt < 8opt-con? 5„ > 6^on,n'^ E^^^ Is dccrcascd by using more predictive elements, but 5^ is 
thereby increased; the stronger the constraint, the more d^^^ ^ is reduced, but at the cost 
of increasing s^p^.^^^. 

5„ and S^^^ „ are the costs of design in the unconstrained and constrained settings, 
respectively. If we have access to an \mlimited number of experiments (and the design 
1 0 procedures do not themselves introduce error), then we could make both 6„ and 5^^^^^ 
arbitrarily small and have 

However, with a low number of experiments, 5^^ can significantly exceed 8^^^ ,^, Thus, 
the error of the designed constrained predictor can be smaller than that of the designed 

1 5 unconstrained predictor. 

FIG. 1 8 illustrates the phenomenon by comparing error of unconstrained and 
constrained predictors. The axes correspond to sample size n and error. The horizontal 
dashed and solid lines represent Sopt £^pt_con, respectively. The decreasing dashed and 
solid lines represent £[s„] and £[£con,n]? respectively. Ifn is sufficiently large (e.g., N2), 

20 then £[sj < E[e^^^ J; however, if n is sufficiently small (e.g., N^), then £[e„] > £[£conJ' 
The point Nq at which the decreasing lines cross is the cut-off: for n > No, the 
constraint is detrimental; for n < Nq, the constraint is beneficial. When n < Nq, the 
advantage of the constraint is measured by the difference between the decreasing solid 
and dashed lines. 

25 Even if a designed constrained predictor does not perform well, the truly 

optimal constrained predictor may still perform well. Moreover, a less constrained 
predictor might provide good prediction had we a sufficient number of experiments to 
design it. A constraint will typically err by missing a relationship, not erroneously 
indicating a strong relationship, thereby avoiding falsely attributing a predictive 

30 relation where none exists. The relationships missed depend on the constraint. 
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Sometimes a system can be modeled in such a way that a constraint can be derived that 
does not yield increased error; however, this is not typical in nonlinear settings. 

Training and Testing Data Sets 
In a mathematical sense, gene relatedness can be quantified by a measure of the 
5 coefficient of determination of an optimal predictor for a set of variables under ideal 
observation conditions, where the coefficient of determination is the relative decrease in 
error owing to the presence of the observed variables, 

0opt = (s.-Sop,)/£* (9) 
where s, is the error for the best predictor in the absence of observations. Since 
10 e^pt < s„ 0 < 9opt < 1. A similar definition applies for constrained predictors. So long as 
the constraint allows all constant predictors, 0 < Q^p^.^on ^ 1 * 
For the unconstrained ternary predictor, 

eopt = (v^opt)/s (10) 

where is the mean square error from predicting Fby applying T([Iy)^ the ternary 
1 5 threshold of the mean of Y. For constrained predictors, e^p^ is replaced by Sopt con to 
obtain Q,^,^,^,, and Q,^,_,,^ < Q,^,. 

For designed predictors, in Equation 10, is replaced by ^ to give 

e„ = (s,,„-8,)/s,,„ (11) 

£[9^], the expected sample coefficient of determination, is found by taking expected 
20 values on both sides of Equation 1 1 . E[e2 > s^pt? typically E[eJ > s^p^, where the 
inequality can be substantial for small samples. Unless n is quite small, it is not 
unreasonable to assume that „ precisely estimates 8^, since estimation of [Xy does not 
require a large sample. Under this assumption, if we set e^^^ = in Equation 1 1 and 
take expectations, we obtain 
25 £[GJ^(s,,„^-£[8j)/8,, (12) 

Because E[sJ > s^pt. Equations 10 and 12 yield £[9 J < G^p^ and 9^ is a low- 
biased estimator of G^jp^. 

For a constrained optimization, is replaced by s^^^ ^, to obtain 9^0^ In analogy 
30 to Equation 8, if there is a sufficient number of experiments, then 
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6con,n ^ ©opt-con < ^opt ( 1 

As the number of samples increases, the approximations get better. In a low sample 
environment, it is not uncommon to have E[Q^^^ 2 > E[Q2* 

Data is used to estimate 9^ and design predictors. A limited sample size 
5 presents a challenge. For unconstrained predictors (and analogously for constrained 

predictors), one can use the resubstitution estimate, Oyi . For resubstitution, data is 
uniformly treated as training data to train (or design) the best predictor, estimates of 
and are obtained by applying the thresholded estimated mean and the designed 

predictor to the training data, and 6yi is then computed by putting these estimates into 

1 0 Equation 1 1 . In other words, the same data can be used to train and test. 6yi estimates 
9n and thereby serves as an estimator of Q^^^. The resubstitution estimate can be 
expected to be optimistic, meaning it is biased high. 

A different approach is to split the data into training and test data, thereby 
producing cross-validation. A predictor is designed from the training data, estimates of 

1 5 J, and c„ are obtained from the training data, and an estimate of is computed by 
putting the error estimates into Equation 1 1 . Since this error depends on the split, the 
procedure is repeated a number of times (e.g., with different splits) and an estimate, 

6yi , is obtained by averaging. For random sampling, the estimates of and are 
unbiased, and therefore the quotient of Equation 1 1 will be essentially (close to being) 
20 imbiased as an estimator of Q^. Since 9^ is a pessimistic (low-biased) estimator of Q^^^, 

Oyi is a pessimistic estimator of 

Another issue is the number of predictor variables. For m and r predictor 
variables, m<r, iiz^J^ denotes the error for the m- variable predictor, then 
£opt(r) < £opt(^)- Prediction error decreases with an increasing number of variables. 
25 Hence, ^^xiSf) —^opt(^)^ However, with an increasing number of variables comes an 

increase in the cost of estimation (the difference between the errors of the designed and 
optimal predictors.) Intuitively, the information added by expanding the number of 
predictor variables becomes ever more redundant, thereby lessening the incremental 
predictive capability being added, whereas the inherent statistical variability in the new 
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variables increases the cost (error) of design. Letting dj^m) = E[e^(m)] - £opt(^)? 
have 8„(m) < 5„(r), and it may happen that sjr)> sjm) and 0„(r)< 9„(m). Since 

OoptW - ^opxO^X we choose the maximum between O^ir) and ^n(^) c)ur estimator of 

5 Cross validation is beneficial because 0 ^ gives a conservative estimate of B^p^ . 

Thus, we do not obtain an overly optimistic view of the determination. On the other 
hand, training and testing with the same data provides a large computational savings. 
This is important when searching over large combinations of predictor and target genes. 
Under the current scenarios, the goal is typically to compare coefficients of 

10 determination to find sets that appear promising. In one case, comparison is between 
high-biased values; in another, comparison is between low-biased values. 

Exemplary Gene Expression Data Analysis to Quantify Gene Relatedness 
The following example demonstrates the ability of the full-logic and perceptron 
predictors to detect relationships based on changes of transcript level in response to 

15 genotoxic stresses. 

As a result of a microarray study surveying transcription of 1238 genes during 
the response of a myeloid line to ionizing radiation {see Amundson et ah, "Fluorescent 
cDNA Microarray Hybridization Reveals Complexity and Heterogeneity of Cellular 
Genotoxic Stress Responses," Oncogene 18, 3666-3672, 1999), 30 genes not previously 

20 known to participate in response to ionizing radiation ("IR") were found to be so 

responsive. The responsiveness of a subset of nine of these genes was examined by 
blot assays in 12 cell lines stimulated with IR, a chemical mutagen methyl methane 
sulfonate ("MMS"), or ultraviolet radiation ("UV"). The cell lines were chosen so a 
sampling of both p53 proficient and p53 deficient cells would be assayed. 

25 Table 4 shows data indicating observed gene expression levels for the various 

genes. The condition variables IR, MMS, and UV have values 1 or 0, depending on 
whether the condition is or is not in effect. Gene expression is based on comparisons to 
the same cell line not exposed to the experimental condition. -1 means expression was 
observed to go down, relative to untreated; 0 means unchanged; and +1 means 

30 expression was observed to go up. 
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To validate the analysis, data for two fictitious genes (*'AHA" and "OHO") were 
generated. Gene expression rules for the fictitious genes were created, making 
expression of the fictitious genes dependent on other gene expression levels in the set. 
However, the concordance of the generated data with the rules was varied (i.e., the rules 
5 were not strictly adhered to). 

For gene AHA, a rule states AHA is up-regulated if p53 is up-regulated, but 
down-regulated if RCHl and p53 are down-regulated. Full concordance with the rule 
would have produced 1 5 instances of up-regulation and 5 instances of down-regulation; 
the data generated includes 13 of the 15 up-regulations and all of the down-regulations. 

10 For gene OHO, a rule states OHO is up-regulated if MDM2 is up-regulated and 

RCHl is down-regulated; OHO is down-regulated if p53 is down-regulated and REL-B 
is up-regulated. Full concordance with this rule would have produced 4 up-regulations 
and 5 down-regulations. The data generated has the 4 up-regulations, plus 7 others, and 
only 2 of the 5 down-regulations. 

1 5 The data show the genes in the survey are not uniformly regulated in the various 

cell types. The genes showed up- or down-regulation in at least one cell type; however, 
the number of changes observed across the lines varies. Such a varied response reflects 
the different ways in which different cells respond to the same external stimuli based on 
their own internal states. The data are therefore a useful test of the predictors. Since 

20 predictors operate by rules relating changes in one gene with changes in others, 

predictors were not created to predict expression of genes exhibiting fewer than four 
changes in the set of 30 observations. MBPl and SSAT were thus eliminated as 
predicted genes. 

For the fictitious genes, the designed predictors identified both the p53 and 
25 RCHl components of the transcription rule set for AHA. For instance, using the 

perceptron and splitting the training data set fi*om the test data set, a 0.785 estimate of 
the coefficient of determination was given. This value is biased low because splitting 
the data results in cross-validation. Since many rule violations were introduced into the 
data set for the OHO gene, it was expected that the coefficient of determination would 
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not be high when using the predictors MDM2, RCHl, p53, and REL-B; this 
expectation was met. 



Table 4 - Gene Expression Observations 
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The coefficient of determination calculations can be presented as arrow plots of 
prediction trees, as shown in FIG. 19. The predicted gene (sometimes called a "target" 
gene) is shown to the right and the chained predictors to the left. The coefficient of 
5 determination that resuhs when adjoining a predictive element is placed to the right of 
the predictive element. For example, in FIG. 19, Predictor 1 achieves a coefficient of 
determination of Oj. Using Predictors 1 and 2 together achieves 02, and using Predictors 
1,2, and 3 together achieves 63. 

The analysis agrees with expectations provided by existing biological 

1 0 information. For example, it is expected that MDM2 is incompletely, but strongly 
predicted by p53. As shown in FIG. 20A, this expectation was met. Additions of 
further genes to p53 do not increase the accuracy of the prediction. Similarly, since it is 
known that p53 is influential, but not determinative of the up-regulation of both p21 
and MDM2, some level of prediction of p53 should be possible by a combination of 

15 these two genes. As shown in FIG. 20B, this expectation was also met. Moreover, as 
p21 shows both p53 dependent and p53 independent regulation in response to genomic 
damage (see Gorospe et aL, "Up-regulation and Function Role of p21 Wafl/Cipl During 
Growth Arrest of Human Breast Carcinoma MCF-7 Cells by Pheny acetate," Cell 
Growth Differ 7, 1609-15, 1996), it was expected that the p53 component would not be 

20 recognized by the analysis. As shown in FIG. 20C, two other genes (MDM2 and 
ATF3) were chosen over p53 as better predictive elements. 

Among the newly-found IR responsive genes (FRAl, ATF3, REL-B, RCHl, 
PCI, IAP-1, and MBP-I), a set of relationships is seen that appears to link the 
behaviors of REL-B, RCHl, PC-1, MBP-1, BCL2, IAP-1, and SSAT. Sets of full-logic 

25 prediction trees for REL-B, PC-1, RCHl, and IAP-1 are shown in FIGS. 21 A-D; 

perceptron versions appear at FIGS. 22 A-D. Both a full-logic and perceptron analyses 
find a variety of apparently significant similarities of expression behavior within this 
set. Some of these genes also show a high degree of predictability based solely on 
exposure to ionizing radiation. When these genes are viewed with an eye to IR 

30 responsiveness, it becomes apparent they share an overall trend to show expression 
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level changes in response to IR rather than to UV or MMS. Even though MBPl and 
SSAT only responded to IR at the very low rate of 17% of the possible changes, they 
responded only to this stimulus and not to the other genotoxic stimuli, and were thus 
associated with other genes showing similar response. This example illustrates the 
5 power of the analyses to find new, potentially significant relations. 

Using data splitting, the error relation between the full-logic and perceptron 
predictors is illustrated. Using cross-validation, the estimated coefficients of 
determination of the best full-logic and perceptron predictors of BCL3 in terms of lAP- 
1, PC-1 and SSAT are 0.334 and 0.461, respectively. Since the estimate by the 

10 perceptron exceeds that by the full-logic version, the optimal full-logic predictor has a 
coefficient of determination greater than 0.461 . This is true because the error of the 
optimal predictor is less than or equal to that of a constrained predictor. In fact, it could 
be that the optimal full-logic predictor is a perceptron, but this is not determined from 
the limited data available, 

1 5 For prediction of BCL3 in terms of RCHl , SSAT, and p2 1 , the best full-logic 

and perceptron predictors (using cross validation) estimate the coefficient of 
determination at 0.652 and 0.43 1, respectively. Constraining prediction to a perceptron 
underestimates the degree to which BCL3 can be predicted by RCHl, SSAT, and p21 . 
In fact, the true coefficient of determination for the optimal full-logic predictor is likely 

20 to be significantly greater than 0.652, Moreover, performance of the optimal full-logic 
predictor almost surely exceeds that of the optimal perceptron by more than the 
differential 0.221, but this cannot be quantified from the data. However, it can be 
concluded with confidence that the substantial superior performance of the full-logic 
predictor shows the relation between RCHl, SSAT, and p21 (as predictive elements) 

25 and BCL3 (as predicted gene) is strongly nonlinear. 

The different ways in which full-logic and perceptron predictors operate to find 
relationships can be illustrated by examining the prediction each makes for the target 
gene BCL3. The prediction trees of FIGS. 23 A and 23B illustrate the optimal gene sets 
chosen by a perceptron method and full-logic method, respectively. FIG. 23C shows 

30 the perceptron-based method's quantification for the genes selected by the full-logic 
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version, and FIG. 23D shows the full-logic method's quantification for the genes 
selected by the perceptron version. Each approach imposes different computation 
constraints on input measurements. 

Exemplary User Interface 
5 Various user interface features are provided to enable users to more easily 

navigate and analyze the data generated via the techniques described above. 

For example, the results of an analysis can be computed and stored into tables 
(e.g., one for each predicted gene) for access by a user. So that a user can more easily 
peruse the tables, various table-manipulating features are provided. 
10 A user can specify a range n-m for a table, and the sets of predictive elements 

ranked (based on coefficient of determination) n through m are shovm, and the user can 
scroll through the results. A gene can be specified as "required," in which case only 
sets of predictive elements containing the specified gene are displayed. 

One can limit the number of times a gene or gene set appears in the table to omit 
1 5 combinations containing a particular gene. This can be useful, for example, because 
adding further genes to the set simply increases the coefficient of determination. 

One can specify an incremental threshold. Then, only those predictor sets for 
which there is an increase greater than or equal to the incremental threshold when a 
gene has been added are shown. This can be useful, for example, in identifying 
20 situations in which adjoining a gene to a gene set yields a significant increase in the 
coefficient of determination. 

One can delete from a table any gene with an expression level that did not 
change some minimal number of times across the experiments, and the user can issue a 
query for data regarding a specific target and predictor set. 
25 In addition, the results can be reviewed as part of a helpful graphical user 

interface. For example, for any set of predictive elements and a predicted gene, a graph 
can be plotted to show the increase in coefficient of determination as the set of 
predictive elements grows. The order of inclusion of the predictive elements can be 
specified, or the graph can automatically display the best single predictive element first, 



33 



E-059-00/0 06/1 5/00 4239-54279/WDN/GLM 



EXPRESS MAIL EM126586308US 



and given that element, the next best two are shown next, and so forth. An example for 
gene AHA based on the data of Table 4 is shown in Fig. 24. 

In addition to showing the performance of a single set of predictive elements for 
a predicted gene, the software allows visualization for plural sets for a given predicted 
5 gene or for more than one predicted gene. 

FIG, 25 shows an exemplary user interface for presenting plural sets of 
predictive elements for a particular predicted gene, ATF3. Typically, redundancies are 
automatically removed. In an actual analysis of the data shown in Table 4, 80 sets of 
predictive elements were found after redundancies were removed. For the sake of 
10 brevity, FIG, 25 shows only some of the predictive elements in the form presented by a 
user interface. 

Within the display region 2502, bars 2508 representing predictive elements are 
shown. A legend 2512 denotes a color for the predictive elements and predicted genes 
shown. To assist in interpretation, colors in the legend are typically chosen in order 

15 along the frequency of visible light (i.e., the colors appear as a rainbow). Further, the 
bars are ordered by size (e.g., longer bars are at the top). As a result, genes most likely 
to be related to the target gene are presented at the top. 

For example, bar 2522 represents a coefficient of determination value slightly 
less than 0.75. The predicted gene and contributors to the coefficient are denoted by 

20 bar segments 2522A-2522D. 2522A is of fixed length and is of a color representing the 
predicted gene (ATF3). The bar segment 2522B is of a length and appropriate color 
indicating contribution by a predictive element (PC-1). The bar segments 2522C-D 
similarly indicate contributions by IR and FRAl, respectively, and are also of a color 
according to the legend 2512. 

25 The user can specify which bar segments are to be displayed via the controls 

2532. By checking the predictors checkbox 2542, the display is limited to those bar 
segments having the predictors listed in the box 2544. Predictors can be added and 
deleted from the box 2544. 

The user can limit the display to a particular predicted gene specified in the 

30 target gene box 2552 or display bars for all targets by clicking the "all targets" 
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checkbox 2554. For the sake of completeness, conditions such as IR are included as 
targets in the display. 

Further, the user can specify a threshold by checking the threshold checkbox 
2562 and specifying a value in the threshold box 2564. The threshold slider 2566 
5 allows another way to specify or modify a threshold. Specifying a threshold limits the 
display to those bars having a total coefficient of determination at least as high as the 
threshold specified. To assist in visual presentation, the bars can grow or shrink in 
height depending on how many axe presented. For example, if very few bars are 
presented, they become taller and the colors are thus more easily distinguishable on the 
10 display. 

When bars for plural targets are displayed, the bars can be ordered based on 
predicted gene first and length second. For example, the set of bars relating to a first 
particular predicted gene are presented in order of length at the top, followed 
underneath by a set of bars relating to a second particular predicted gene in order of 

1 5 length, and so forth. If the predicted genes are ordered according to the corresponding 
color's frequency, the bar segments representing the predicted genes (e.g., 2522A) can 
be ordered to produce a rainbow effect. Such a presentation assists in differentiation of 
the colors and identification within the legend. This is sometimes helpful because a 
presentation of plural targets tends to have very many bars, which can visually 

20 overwhelm the user. 

In the case of a predictive element set of size three, a cubical graph can portray 
the observed data for each predictive element. For example, FIG. 26 shows data 
observed for the PC-1, RCHl, and p53, which were used to predict the fictitious gene 
AHA. The presence and size of spheres appearing at various coordinates indicate the 

25 number of observations for a particular triple in the data of Table 4. The color of the 
sphere indicates whether expression of AHA was up-regulated (red), unchanged 
(yellow), or down-regulated (green). The percentage of observations falling within the 
triple is also displayed. 

For example, sphere 2622 indicates that a particular triple (PC-1 = +1, 

30 RCHl = -1, and p53 = -1) was observed three times. The values in parentheses indicate 
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the number of times down-regulation, unchanged, or up-regulation were observed, 
respectively, for this triple (3 dovm-regulations were observed). The sphere is green 
because most (i.e., all 3) of the observations were dovra-regulations. 10% of the total 
observations relating to the AHA gene were of the triple (+1, -1, -1), so "0.10" is 

5 displayed proximate the sphere. 

The cubical graph shows the degree to which the target values tend to separate 
the predictive elements. Actual counts are also shown, along v^th the fraction of time a 
triple appears in the observations. The cubical graph can be rotated and viewed from 
different angles. An object other than a sphere can be used. 

10 In addition, the perceptron predictor for a particular set of predictive elements 

can be displayed as a cube, with planes defining the perceptron depicted thereon. For 
example, FIG. 27 portrays a perceptron for predicting the gene "AHA." Values 
provided by the perceptron are indicated on the cube by depicting spheres of various 
colors. If up-regulation is predicted, a red sphere appears; if down-regulation is 

1 5 predicted, a green sphere appears; otherwise a yellow sphere appears. 

A set of two planes defines the perceptron and will separate the groups of 
colored spheres into three regions, where each region contains spheres of the same 
color. The display can be rotated. 

In some cases, it may be desirable to construct a hardware circuit implementing 

20 a model. For example, in a scenario involving a large number of predictive elements 
and predicted genes, a set of specialized circuits could be constructed to predict gene 
expression via nonlinear models. Effectiveness of these circuits can be measured to 
quantify relatedness between the predictive elements and the predicted gene. Such an 
approach may be superior to a software-based approach. 

25 Considerations conceming constraint of the models may arise due to the choice 

of hardware. For example, it may be particularly efficient to construct the hardware 
from a limited set of logic or a limited number of logic circuits. By limiting the circuit 
to a constrained class of hardware, a constrained logic predictor resuhs. Thus, error 
relating to estimation of the coefficient of determination may be avoided. 
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For example, the software can generate a logical circuit implementation of the 
perceptron discussed above. A comparator-based logic architecture based on the signal 
representation theory of mathematical morphology can be used. {See Dougherty and 
Barrera, "Computational Gray-Scale Operators," Nonlinear Filters for Image 
5 Processing, pp, 61-98, Dougherty and Astola eds., SPIE and IEEE Presses, 1999.) 

The expression levels of the predictive elements are input in parallel to two 
banks of comparators, each of which is an integrated circuit of logic gates and each 
being denoted by a triangle having two terminals. One terminal receives the input 
(Xi, X2, X3) and the other has a fixed vector input (tj, tj, tg). If (tj, X^^ is at the upper 

10 terminal, then the comparator outputs 1 if x^ < t^, X2 < X2 and X3 < tg; otherwise, it 
outputs 0. If (ti, t2, t3) is at the lower terminal, then the comparator outputs 1 if 
Xj > tj, X2 > t2 and X3 > X^\ otherwise, it outputs 0. The outputs of each comparator pair 
enter an AND gate that outputs 1 if and only if the inputs are between the upper and 
lower bounds of the comparator pair. The AND outputs in the upper bank enter an OR 

1 5 gate, as do the AND outputs of the lower bank. The outputs of the OR gates enter a 
multiplexer. There are three possibilities: (1) both OR gates output 0, in which the 
multiplexer (and hence the circuit) outputs 0; (2) the upper OR gate outputs 1 and the 
lower OR gate outputs 0, in which case the multiplexer outputs -1 ; (3) the upper OR 
gate outputs 0 and the lower OR gate outputs 1 , in which case the multiplexer outputs 

20 1, FIG. 28 shows a logic implementation for the perceptron shown in FIG. 27. 

A variety of other representations of a model can be used. For example, a 
decision tree is one way of representing the model. In some cases, there will be 
multiple equivalent ways to define the same model. Logic representations are useful 
because they sometimes result in faster analysis (e.g., when implemented in specialized 

25 hardware). 

Exemplary Computer System for Conducting Analysis 
Figure 29 and the follov^ng discussion are intended to provide a brief, general 
description of a suitable computing environment for the computer programs described 
above. The method for quantifying gene relatedness is implemented in computer- 
30 executable instructions organized in program modules. The program modules include 
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the routines, programs, objects, components, and data structures that perform the tasks 
and implement the data types for implementing the techniques described above. 

While Fig, 29 shows a typical configuration of a desktop computer, the 
invention may be implemented in other computer system configurations, including 
5 multiprocessor systems, microprocessor-based or programmable consumer electronics, 
minicomputers, mainframe computers, and the like. The invention may also be used in 
distributed computing environments where tasks are performed in parallel by 
processing devices to enhance performance. For example, tasks related to measuring 
the effectiveness of a large set of nonlinear models can be performed simultaneously on 

10 multiple computers, multiple processors in a single computer, or both. In a distributed 
computing environment, program modules may be located in both local and remote 
memory storage devices. 

The computer system shown in Fig. 29 is suitable for carrying out the invention 
and includes a computer 2920, with a processing unit 2921, a system memory 2922, 

1 5 and a system bus 2923 that interconnects various system components, including the 
system memory to the processing unit 2921 . The system bus may comprise any of 
several types of bus structures including a memory bus or memory controller, a 
peripheral bus, and a local bus using a bus architecture. The system memory includes 
read only memory (ROM) 2924 and random access memory (RAM) 2925, A 

20 nonvolatile system 2926 (e.g., BIOS) can be stored in ROM 2924 and contains the 
basic routines for transferring information between elements within the personal 
computer 2920, such as during start-up. The personal computer 2920 can further 
include a hard disk drive 2927, a magnetic disk drive 2928, e.g., to read from or write 
to a removable disk 2929, and an optical disk drive 2930, e.g., for reading a CD-ROM 

25 disk 293 1 or to read from or write to other optical media. The hard disk drive 2927, 
magnetic disk drive 2928, and optical disk drive 2930 are connected to the system bus 
2923 by a hard disk drive interface 2932, a magnetic disk drive interface 2933, and an 
optical drive interface 2934, respectively. The drives and their associated computer- 
readable media provide nonvolatile storage of data, data structures, computer- 

30 executable instructions (including program code such as dynamic link libraries and 
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executable files), and the like for the personal computer 2920. Although the description 
of computer-readable media above refers to a hard disk, a removable magnetic disk, 
and a CD, it can also include other types of media that are readable by a computer, such 
as magnetic cassettes, flash memory cards, digital video disks, and the like. 
5 A number of program modules may be stored in the drives and RAM 2925, 

including an operating system 2935, one or more application programs 2936, other 
program modules 2937, and program data 2938. A user may enter commands and 
information into the personal computer 2920 through a keyboard 2940 and pointing 
device, such as a mouse 2942. Other input devices (not shown) may include a 

10 microphone, joystick, game pad, satellite dish, scanner, or the like. These and other 
input devices are often connected to the processing unit 2921 through a serial port 
interface 2946 that is coupled to the system bus, but may be connected by other 
interfaces, such as a parallel port, game port, or a universal serial bus (USB). A 
monitor 2947 or other type of display device is also connected to the system bus 2923 

15 via an interface, such as a display controller or video adapter 2948, In addition to the 
monitor, personal computers typically include other peripheral output devices (not 
shown), such as speakers and printers. 

The above computer system is provided merely as an example. The invention 
can be carried out in a wide variety of other configurations. Further, a wide variety of 

20 approaches for collecting and analyzing data related to quantifying gene relatedness is 
possible. For example, the data can be collected, nonlinear models built, the models' 
effectiveness measured, and the results presented on different computer systems as 
appropriate. In addition, various software aspects can be implemented in hardware, and 
vice versa. 

25 Having illustrated and described the principles of the invention in exemplary 

embodiments, it should be apparent to those skilled in the art that the illustrative 
embodiments can be modified in arrangement and detail without departing from such 
principles. Although high throughput cDNA technology is presented as an example, 
any number of other art-known techniques can be used to measure gene expression, 

30 including for instance quantitative and semi-quantitative nucleotide amplification (e.g.. 
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PGR) techniques. Another approach that can be used to detect differential gene 
expression is a tissue microarray, as disclosed for example in PCX publications WO 
99/44063 and WO 99/44062. Various other techniques are described in the following 
references, which are hereby incorporated herein by reference: U.S. Patent No. 
5 5,994,076 to Chenchik et al., entitled "Methods of assaying differential expression," 
filed May 21, 1997; U.S. Patent No, 6,059,561 to Becker, entitled "Compositions and 
methods for detecting and quantifying biological samples," filed June 9, 1998; Tewary 
et aL, "Qualitative and quantitative measurements of oligonucleotides in gene therapy: 
Part I. In vitro models," J Pharm Biomed Anal, 15:857-73, April 1997; Tewary et aL, 

10 "Qualitative and quantitative measurements of oligonucleotides in gene therapy: Part II 
in vivo models," J Pharm BiomedAnal, 15:1 127-35, May 1997; Komminoth et al, "In 
situ polymerase chain reaction: general methodology and recent advances," Verh Dtsch 
Ges Pathol^ 78:146-52, 1994; Bell et al., "The polymerase chain reaction," Immunol 
Today, 10:351-5, October 1989. Further, the principles of the invention can be applied 

15 to protein-based measurements (e.g., comparison of differential protein expression) 
without regard to the level of expressed nucleotides. In view of the many possible 
embodiments to which the principles of the invention may be applied, it should be 
understood that the illustrative embodiments are intended to teach these principles and 
are not intended to be a limitation on the scope of the invention. We therefore claim as 

20 our invention all that comes within the scope and spirit of the following claims and 
their equivalents. 



40 



E-059-00/0 06/15/00 4239-54279/WDN/GLM 



EXPRESS MAIL EM126586308US 



We claim: 

1 . A computer-implemented method for quantifying gene relatedness for a 
plurality of candidate genes for which a plurality of gene expression level observations 
have been collected, the method comprising: 

5 based on data comprising the plurality of gene expression level observations for 

the plurality of candidate genes, constructing a nonlinear model predicting gene 
expression; 

predicting gene expression v^ith the nonlinear model; and 
measuring effectiveness of the nonlinear model in predicting gene expression, 
10 thereby quantifying gene relatedness for the plurality of candidate genes. 

2. A computer-readable medium comprising computer-readable 
instructions for performing the method of claim 1 . 

15 3. The method of claim 1 wherein the nonlinear model accepts a plurality 

of predictive elements as inputs, wherein at least one of the predictive elements 
indicates whether a gene expression observation is associated with having applied a 
particular external stimulus to biological material. 

20 4. The method of claim 1 wherein the nonlinear model accepts a plurality 

of predictive elements as inputs, wherein at least one of the predictive elements 
indicates whether a gene expression observation is associated with a particular cell 
state. 

25 5. The method of claim 1 wherein the nonlinear model accepts a plurality 

of predictive elements as inputs, wherein at least one of the predictive elements 
indicates differential gene expression between two samples of biological material. 
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6. The method of claim 1 wherein the nonUnear model comprises a 
multivariate prediction function accepting two or more inputs to predict gene 
expression, 

5 7. The method of claim 1 wherein measuring effectiveness of the nonlinear 

model comprises comparing observed gene expression to gene expression predicted by 
the nonlinear model. 

8. The method of claim 1 wherein constructing the nonlinear model 

1 0 predicting gene expression comprises choosing a nonlinear model from a constrained 
set of nonlinear models. 

9. The method of claim 1 wherein measuring the model's effectiveness 
comprises evaluating the model to estimate a coefficient of determination for an 

15 optimal model estimated by the model. 

10. The method of claim 1 wherein the nonlinear model predicting gene 
expression is a full-logic model predicting gene expression for a predicted candidate 
gene, and the effectiveness of the model is measured by comparing predictions of gene 

20 expression for the predicted candidate gene by the model with observations of gene 
expression for the predicted candidate gene. 

1 1 . The method of claim 1 further comprising: 

obtaining the data comprising the plurality of gene expression level 
25 observations from results of a plurality of cDNA microarray experiments measuring 
mRNA transcription levels for a plurality of genes in biological material. 
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12. The method of claim 1 wherein the data comprising a plurality of gene 
expression observations is divided into a training set of data and a test set of data, 
wherein 

the nonlinear model predicting gene expression is generated via the training set 
5 data; and 

effectiveness of the nonlinear model is measured via the test set of data. 

13. The method of claim 12 wherein the training set of data is extended by 
randomly reordering and recycling gene expression observations. 

10 

14. The method of claim 1 wherein 

a plurality of training data sets are repeatedly chosen from the data comprising a 
plurality of gene expression observations; 

the nonlinear model is one of a plurality of models constructed from the 
1 5 plurality of training data sets; and 

quantification of the relatedness for the plurality of candidate genes is measured 
by measuring average effectiveness of the plurality of models constructed from the 
plurality of training data sets. 



20 15. The method of claim 1 further comprising: 

to determine contribution of a predictive element to the quantification of 
relatedness, constructing an additional nonlinear model predicting gene expression, 
wherein the additional nonlinear model has a single input. 



25 1 6. The method of claim 1 wherein the nonlinear model predicting gene 

expression is a truth table predicting a gene expression level for a predicted candidate 
gene from predictive elements comprising expression level observations for candidate 
genes other than the predicted candidate gene. 
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17. The method of claim 1 6 wherein the truth table comprises ternary 
discrete values. 

1 8. The method of claim 1 6 wherein gene expression levels in the truth table 
5 are ternary discrete values. 



19. The method of claim 16 wherein 

the truth table comprises a plurality of rows for possible combinations of 
expression level observations for the candidate genes other than the predicted candidate 
1 0 gene; and 

for at least one of the rows, the truth table indicates predicted gene expression 
for the predicted candidate gene with a thresholded weighted average of gene 
expression level observations associated with the row. 

15 20. The method of claim 1 wherein the nonlinear model predicting gene 

expression is a neural network predicting gene expression. 

21 . The method of claim 20 wherein the neural network consists of one 
neuron which predicts a gene expression level for a single predicted candidate gene. 

20 

22. The method of claim 21 wherein the neuron is a ternary perceptron 
accepting predictive elements as inputs, wherein the predictive elements comprise gene 
expression levels indicated as one of three possible values: up, unchanged, and down. 



25 23 . The method of claim 22 further comprising: 

displaying a three-dimensional graph representing the ternary perceptron with 
two planes separating points on the graph into points relating to like predicted values. 
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24. The method of claim 22 further comprising: 

displaying a three-dimensional graph representing the ternary perceptron with 
objects at points in three-dimensional space within the graph, wherein axes of the graph 
relate to thresholded gene expression levels for three of the candidate genes. 

25. The method of claim 24 wherein the objects are of a color indicating a 
predicted gene expression level. 



26. The method of claim 24 wherein the objects are of a size indicating a 
] 0 number of observations related to a point on the graph. 



27. The method of claim 1 wherein the data comprising a plurality of gene 
expression observations comprises gene expression level observations generated by 
subjecting sample biological material to an experimental condition and observing 
] 5 regulation of mRNA transcription levels for a plurality of genes in the biological 
material as a result of being subjected to the experimental condition. 



28. The method of claim 27 wherein the data comprising a plurality of gene 
expression observations further comprises an indication of the experimental condition 
20 to which the biological material related to an observation was subjected and the 
indication is included in the model to predict gene expression. 
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29. A computer-implemented method for identifying genes related to a 
target gene by analyzing gene expression level observations for the genes, the method 
comprising: 

based on the gene expression level observations, constructing multivariate 
5 nonlinear predictors that predict an expression level for the target gene, wherein the 
predictors accept gene expression levels for other genes as predictive elements; 

estimating a coefficient of determination for sets of predictive elements and the 
target gene by comparing results of the multivariate nonlinear predictors with gene 
expression level observations for the target gene, wherein the predictive elements 
1 0 comprise expression level observations for genes other than the target gene; and 

ranking the groups of genes other than the target gene by coefficient of 
determination to present the genes other than the target gene in order of likelihood of 
relatedness to the target gene. 



1 5 30, The method of claim 29 further comprising: 

indicating a proper subset of the genes having the highest likelihood of 
relatedness to the target gene. 



31. A computer-implemented method for analyzing gene expression level 
20 observations for a set of genes comprising a target gene, the method comprising: 

estimating a coefficient of determination for an optimal multivariate nonlinear 
model predicting gene expression of the target gene by constructing a multivariate 
nonlinear model from the gene expression level observations of gene expression for the 
target gene, wherein the optimal multivariate nonlinear model and the constructed 
25 multivariate nonlinear model predict gene expression of the target gene based on 
variables representing gene expression levels of genes other than the target gene. 
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32. The method of claim 3 1 wherein the optimal multivariate nonlinear 
model and the constructed multivariate nonlinear model predict gene expression based, 
at least in part, on inputs comprising an indication of a condition to which biological 
material relating to the observations has been subjected. 

5 

33. A method for identifying related genes out of a set of genes for which 
gene expression level observations have been collected, the method comprising: 

for at least one predicted gene out of the set of genes, training an artificial 
intelligence function to predict gene expression for the predicted gene, wherein the 
10 artificial intelligence function takes one or more predictive elements as inputs and 
produces a gene expression level for the predicted gene as an output, wherein at least 
one of the predictive elements is a gene expression level for a gene other than the 
predicted gene; and 

testing effectiveness of the artificial intelligence function in predicting 
1 5 expression of the predicted gene to rate relatedness of the predicted gene and at least 
one gene associated with the predictive elements. 

34. The method of claim 33 wherein the artificial intelligence function takes 
a plurality of predictive elements as inputs. 

20 

35. The method of claim 33 wherein the predictive elements comprise a 
variable indicating biological material was subjected to an experimental condition. 
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36. For a pliirality of observed genes for which expression levels have been 
observed, a method of presenting an analysis of the expression levels to assist in 
identifying related genes, the method comprising: 

denoting a particular observed gene as a predicted gene; 

for the predicted gene, constructing a plurality of nonlinear multivariate models 
predicting expression of the observed gene, wherein the nonlinear multivariate models 
comprise a variety of predictive elements chosen from permutations of expression 
levels of observed genes other than the predicted gene; 

measuring effectiveness of the nonlinear multivariate models in predicting 
expression of the predicted gene to quantify relatedness between the predicted gene and 
the set of genes associated with the predictive elements of the models; and 

presenting a quantification of relatedness between the predicted gene and a set 
of genes associated with the predictive elements of at least one of the models. 

37. The method of claim 36 further comprising: 

for a set of predictive elements and a predicted gene, displaying a graph 
indicating the amount of increase in the effectiveness of the model for each of the 
predictive elements. 

38. The method of claim 36 wherein at least two of the plurality of nonlinear 
multivariate models predicting expression of the observed gene are implemented in 
specialized hardware circuits for predicting gene expression. 

39. The method of claim 36 further comprising: 

displaying a user interface for evaluating the analysis, wherein the user interface 
comprises display elements graphically indicating the relatedness of the predicted gene 
to a plurality of gene sets. 
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40. The method of claim 39 further comprising: 

displaying only those display elements indicating sets of genes in which each 
gene in the set improves the relatedness. 

4 1 . The method of claim 39 further comprising: 

accepting as input a set of one or more designated predictor genes; 
accepting as input a threshold relatedness; 

accepting as input a set of one or more designated predicted genes; and 
limiting the display elements of the user interface to those sets of genes having 
as members the one or more designated predictor genes and having at least the 
threshold relatedness for the one or more designated predicted genes. 

42. The method of claim 39 further comprising: 

accepting as input a set of one or more designated predictor genes; and 
limiting display elements of the user interface to those sets of genes having the 
one or more designated predictor genes. 

43. The method of claim 42 further comprising: 
accepting as input a threshold increase in relatedness; and 

further limiting display elements of the user interface to those sets of genes for 
which addition of the one or more designated predictor genes increases the relatedness 
by at least the threshold increase in relatedness. 

44. The method of claim 36 further comprising presenting a ranking of gene 
sets according to their relatedness, wherein the ranking indicates which genes are in the 
sets. 

45. The method of claim 44 wherein the ranking further indicates 
contribution of individual predictive elements to the effectiveness of the models. 
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46. The method of claim 36 wherein the predictive elements comprise a 
variable indicative of whether biological material related to a gene expression 
observation has been subjected to a particular condition. 



47, For a plurality of observed genes for which expression levels have been 
observed, a method of performing an analysis of the expression levels to assist in 
identifying related genes, the method comprising: 

(a) for a plurality of the observed genes, denoting a particular observed gene as 
a predicted gene and performing at least (b) and (c); 

(b) for the predicted gene, constructing a plurality of nonlinear multivariate 
models predicting expression of the predicted gene, wherein the nonlinear multivariate 
models have a variety of predictive elements chosen from permutations of expression 
levels of observed genes other than the predicted gene; 

(c) measuring effectiveness of the nonlinear multivariate models in predicting 
expression of the predicted gene to provide a quantification of relatedness between the 
predicted gene and genes associated with the predictive elements of the models. 

48, The method of claim 47 further comprising: 

skipping designating genes having fewer than a defined number of changes in 
expression level as predicted genes. 

49, The method of claim 47 further comprising: 

displaying a user interface comprising display elements indicating gene 
relatedness for a plurality of genes associated with the predictive elements for a 
plurality of predicted genes. 
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50. A system for quantifying gene relatedness for a plurality of candidate 
genes for which a plurality of gene expression level observations have been collected, 
the system comprising: 

means for constructing a nonlinear model predicting gene expression based on 
5 data comprising the plurality of gene expression level observations for the plurality of 
candidate genes; 

means for predicting gene expression with the nonlinear model; and 
means for measuring effectiveness of the nonlinear model in predicting gene 
expression, thereby quantifying gene relatedness for the plurality of candidate genes. 

10 

5 1 . The method of claim 50 wherein the means for predicting gene 
expression is a specialized hardware circuit. 

52. The method of claim 50 wherein the means for predicting gene 
1 5 expression is a decision tree. 

53. The method of claim 50 wherein the means for predicting gene 
expression is a truth table chosen from a constrained set of truth tables. 

20 54. A system for quantifying the relatedness of a set of genes, the system 

comprising: 

a multivariate nonlinear predictor constructor operable to construct a 
multivariate nonlinear predictor based on gene expression level observations for a 
plurality of candidate genes; and 
25 a multivariate nonlinear predictor tester operable to test the effectiveness of the 

multivariate nonlinear predictor to quantify relatedness for the plurality of candidate 
genes. 
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55. A computer user interface system for presenting results of a nonlinear 
multivariate prediction analysis of gene expression level data, the computer user 
interface system comprising: 

a plurality of display elements, wherein each display element indicates a set of 
5 genes and a coefficient of determination of the set of genes for a target gene. 

56. The computer user interface system of claim 55 wherein the plurality of 
display elements indicate a set of genes by displaying a different color for each gene in 
the set of genes. 

10 

57. The computer user interface system of claim 56 wherein the plurality of 
display elements indicate a set of genes by displaying a bar segment of a different color 
for each gene in the set of genes and the bar segment is of a size indicating contribution 
to the coefficient of determination by each gene in the set of genes. 

15 

58. The computer user interface system of claim 57 further comprising a 
contiguous display region for denoting a target gene associated with sets of genes, 
wherein the target gene is denoted with a color for the target gene and the sets of genes 
are grouped by target gene. 

20 

59. The computer user interface system of claim 55 wherein the user 
interface system accepts a threshold coefficient of determination to limit the display to 
sets of genes having at least the threshold coefficient of determination for a target gene. 

25 60. The computer user interface system of claim 55 wherein the user 

interface accepts a threshold increase in coefficient of determination to limit the display 
to sets of genes having at least one gene whose inclusion in the set increases the 
coefficient of determination by at least the threshold increase in coefficient of 
determination. 

30 
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61 . A computer-implemented method of ranking the relatedness of a 
plurahty of genes based on gene expression level observations associated with the 
plurality of genes^ the method comprising: 

based on the gene expression level observations, constructing a plurality of 
5 multivariate nonlinear predictors to predict the expression of a plurality of target genes 
out of the genes, wherein the multivariate nonlinear predictors comprise predictive 
elements comprising an observed gene, thereby associating the multivariate nonlinear 
predictor with the target gene and at least one observed gene; 

testing effectiveness of the plurality of multivariate nonlinear predictors in 
10 predicting gene expression to quantify gene relatedness between the genes associated 
with the predictors by estimating a coefficient of determination; and 

displaying a ranked list of gene relatedness among the genes as determined by 
testing the plurality of multivariate nonlinear predictors. 



53 



E-059-00/0 06/1 5/00 4239-54279/WDN/GLM 



EXPRESS MAIL EM126586308US 



ABSTRACT 

Relatedness between genes is quantified by constructing nonlinear models 
predicting gene expression. Effectiveness of the model is evaluated to provide a 
5 measurement of the relatedness of genes associated with the model. Various types of 
models, including fall-logic or neural networks can be constructed. A graphical user 
interface presents results of the analysis to allow evaluation by a user. Each gene's 
contribution to the measurement of relatedness can be shown on a graph, and graphical 
representations of models used to predict gene expression can be displayed. 

10 
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COMBINED DECLARATION AND POWER OF ATTORNEY 
FOR PATENT APPLICATION 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name, 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, 
first and jomt inventor (if plural names are listed below) of the subject matter which is claimed and for which a 
patent is sought on the invention entitled QUANITIFYING GENE RELATEDNESS VIA NONLINEAR 
PREDICTION OF GENE EXPRESSION LEVELS, the specification of which 

^ is attached hereto. 

I I was filed on as Application No. . 

I I was described and claimed in PCX International Application No. , filed on , and as 

amended under PCX Article 19 on (if applicable). 

I I and was amended on (if applicable). 

[~| with amendments through (if applicable). 

I hereby state that I have reviewed and understand the contents of the above-identified specification, 
including the claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose information which is material to patentability as defined in Xitle 
37, Code of Federal Regulations, ' 1.56. If this is a continuation-in-part application filed under the conditions 
specified in 35 U.S.C. § 120 which discloses and claims subject matter in addition to that disclosed in the prior 
copendhig application, I further acknowledge the duty to disclose material information as defmed in 37 CFR 
§ 1.56 which occurred between the filiag date of the prior application and the national or PCX international 
filing date of the continuation-in-part application. 

I hereby claim foreign priority benefits under Xitle 35, United States Code, § 1 19(a)-(d) of any 
foreign application(s) for patent or inventor's certificate or of an PCX International application(s) designating at 
least one coimtry other than the United States of America listed below and have also identified below any 
foreign application(s) for patent or inventor's certificate or any PCX International application(s) designating at 
least one country other than the United States of America filed by me on the same subject matter having a filing 
date before that of the application(s) on which priority is claimed: 



Prior Foreign Application(s) 




Priority Claimed 


« Number » 


« Coimtry » 


« Day/Month/Year filed 

» 


□ Yes □ No 



I hereby claim the benefit under Xitle 35, United States Code, § 1 19(e) of any United States 
provisional application(s) listed below: 







(Application No.) 


(Filing Date) 



I hereby claim the benefit under Xitle 35, United States Code, § 120 of any United States 
application(s) or § 365(c) of any PCX International application(s) designating the United States, listed below 
and, insofar as the subject matter of each of the claims of this application is not disclosed in the prior United 
States or PCX International application in the manner provided by the first paragraph of Xitle 35, United States 
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Code, § 1 12, 1 acknowledge the duty to disclose material information as defined in Title 37, Code of Federal 
Regulations, § 1, 56(a) which occurred between the filing date of the prior application and the national or PCT 
International filing date of this application: 









(Apphcation No.) 


(Filing Date) 


(Status: patented, pending, abandoned) 



The undersigned hereby authorizes the U.S. attorney or agent named herein to accept and follow 

instructions from as to any action to be taken in the Patent and Trademark Office regarding this 

application without direct communication between the U.S. attorney or agent and the undersigned. In the event 
of a change in the persons from whom instructions may be taken, the U.S. attorney or agent named herein will 
be so notified by the undersigned. 

I hereby appoint the following attomey(s) and/or agent(s) to prosecute this application, to file a 
corresponding international application, and to transact all business in the Patent and Trademark Office 
connected therewith: 



Name 


Reg, No. 


Name 


Reg. No. 


James C. Haight 


25,588 


Jack Spiegel 


34,477 


Robert Benson 


33,612 


David R. Sadowski 


32,808 


Susan S. Rucker 


35,762 


Steven Ferguson 


38,448 


John Peter Kim 


38,514 


Joseph K. Hemby 


42,652 


Stephen L. Finley 


36,357 


Marlene Shinn 


46,005 


Richard R. Rodriguez 


45,980 






.ssociate Power of Attorney to the following: 






Name 


Reg. No. 


Name 


Reg. No. 


Richard J. PoUey 


28,107 


Lisa M. Caldwell 


41,653 


William D. Noonan 


30,878 


Michael D. Jones 


41,879 


Mark L. Becker 


31,325 


Tanya M. Harding 


42,630 


Donald L. Stephens Jr. 


34,022 


Gregory L. Maurer 


43,781 


Stacey C. Slater 


36,011 


Paula A. DeGrandis 


43,581 


Michelle L. Johnson 


36,352 


Samuel E.George 


44,119 


Susan Alpert Siegel 


43,121 







all of the law firm of Klarquist Sparkman Campbell Leigh & Whinston, LLP. 

Address all telephone calls to Gregory L. Maurer, telephone number 503/226-7391 and facsimile 
number 503/228-9446. 

Address all correspondence to: 

KLARQUIST SPARKMAN CAMPBELL 

LEIGH & WHINSTON, LLP 

One World Trade Center, Suite 1600 

121 SW Salmon Street 

Portland, OR 97204-2988 

I hereby declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these statements were made 
with the knowledge that willful false statements and the like so made are punishable by fine or imprisonment, or 
both, under Section 1001 of Title 18 of the United States Code and that such willful false statements may 
jeopardize the validity of the application or any patent issued thereon. 
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Full Name of Sole or first Inventor: Edward R. Dougherty 

Inventor's Signature 

Date 

Residence: College Station, TX 
Citizenship: United States of America 

Post Office Address: 3101 Westchester Ave., College Station, TX 77845 

Full Name of Second Joint Inventor, if any: Seungchan Kim 

Inventor's Signature 

Date 

Residence: College Station, TX 
Citizenship: Korea (R.O.K.; South Korea) 

Post Office Address: 1 Hensel Drive #X3L, College Station, TX 77840 

Full Name of Third Joint Inventor, if any: Michael L. Bittner 

Inventor's Signature 

Date 

Residence: Rockville, MD 
Citizenship: United States of America 

Post Office Address: 1 5 Lodge Place, Rockville, MD 20850 

Full Name of Fourth Joint Inventor, if any: Yidong Chen 

Inventor's Signature 

Date 

Residence: Rockville, MD 
Citizenship: Peoples Republic of China 

Post Office Address: 2888 Bahnoral Drive, Rockville, MD 20850 
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Full Name of Fifth Joint Inventor, if any: Krishnamoorthy Sivakumar 

Inventor's Signature 

Date 

Residence: Pullman, WA 
Citizenship: India 

Post Office Address: 1615 SE Bleasner Drive, #59, Pullman, WA 99163 
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